Install a separate driver instance for each test#201
Conversation
|
|
||
| // verifyWebhook waits until the validating webhook is serving for the given | ||
| // DeviceClass by creating a dry-run ResourceClaim until it succeeds. | ||
| func verifyWebhook(ctx context.Context, deviceClassName string) { |
There was a problem hiding this comment.
This code is mostly moved from e2e_setup_test.go at line 107: https://github.com/willie-yao/dra-example-driver/blob/main/test/e2e/e2e_setup_test.go#L107
Changes: now takes a deviceClassNAme parameter, the test claim's Name is suffixed with it due to parallel webhook installs and added DeviceClass to error messages
| driverPods, err := clientset.CoreV1().Pods(cfg.Namespace).List(ctx, metav1.ListOptions{ | ||
| LabelSelector: driverPodSelector, | ||
| }) | ||
| if err != nil { | ||
| fmt.Fprintf(GinkgoWriter, "Failed to list driver pods: %v\n", err) | ||
| return | ||
| } | ||
| for _, pod := range driverPods.Items { | ||
| for _, c := range pod.Spec.Containers { | ||
| stream, err := clientset.CoreV1().Pods(cfg.Namespace).GetLogs(pod.Name, &corev1.PodLogOptions{ | ||
| Container: c.Name, | ||
| TailLines: &tailLines, | ||
| }).Stream(ctx) | ||
| if err != nil { | ||
| fmt.Fprintf(GinkgoWriter, | ||
| "Driver pod %s, container %s: failed to get logs: %v\n", | ||
| pod.Name, c.Name, err) | ||
| continue | ||
| } | ||
| buf := new(bytes.Buffer) | ||
| _, _ = io.Copy(buf, stream) | ||
| stream.Close() | ||
| fmt.Fprintf(GinkgoWriter, | ||
| "Driver pod %s, container %s (last %d lines):\n%s\n", | ||
| pod.Name, c.Name, tailLines, buf.String()) | ||
| } | ||
| } |
There was a problem hiding this comment.
This is moved from e2e_setup_test.go on line 192: https://github.com/willie-yao/dra-example-driver/blob/main/test/e2e/e2e_setup_test.go#L192
|
/assign @nojnhuh |
nojnhuh
left a comment
There was a problem hiding this comment.
Sorry for the annoying comments on the workspace/lint periphery here. I promise I'm starting to look at the main point of this PR.
| golangci-lint run ./... | ||
| cd test && golangci-lint run --build-tags=e2e ./... |
There was a problem hiding this comment.
This hides lint failures in the test module if any issues are found in the main module. Can we combine into one call?
| golangci-lint run ./... | |
| cd test && golangci-lint run --build-tags=e2e ./... | |
| golangci-lint run ./... | |
| golangci-lint run --build-tags=e2e ./... ./test/... |
There was a problem hiding this comment.
Tried combining it with golangci-lint run --build-tags=e2e ./... ./test/... but it silently doesn't lint the test module. It logs pattern ./test/...: directory prefix test does not contain main module or its selected dependencies and exits 0 with 0 issues. I'm thinking to do it by just creating another make target for test
.PHONY: lint lint-root lint-test
lint:
@$(MAKE) -k lint-root lint-test
lint-root:
golangci-lint run ./...
lint-test:
cd test && golangci-lint run --build-tags=e2e ./...
There was a problem hiding this comment.
With a workspace set up, it seems to work for me: willie-yao/dra-example-driver@separate-driver-install...nojnhuh:dra-example-driver:workspace-lint
% make lint
golangci-lint run --build-tags=e2e ./... ./test/...
cmd/dra-example-webhook/main.go:183:1: Comment should end in a period (godot)
// function
^
test/e2e/e2e_setup_test.go:54:1: Comment should end in a period (godot)
// driverPodSelector finds kubelet plugin Pods within an installed driver's release namespace
^
2 issues:
* godot: 2
make: *** [Makefile:83: lint] Error 1There was a problem hiding this comment.
Ah that was completely my bad. I had v0.0.0 set before when testing something. Setting it to v0.2.1 fixed it! Thanks for checking this for me
| k8s.io/api v0.36.0 | ||
| k8s.io/apimachinery v0.36.0 | ||
| k8s.io/client-go v0.36.0 | ||
| sigs.k8s.io/dra-example-driver v0.0.0 |
There was a problem hiding this comment.
go mod tidy seems to want to put v0.2.1 here if this doesn't already exist. That should still ultimately resolve to the local version though and not that actual tagged version.
There was a problem hiding this comment.
I confirmed locally that with replace => ../ present, v0.2.1 or any other version resolves to the local checkout. Should I change this to v0.2.1?
e5f08a6 to
27b36c3
Compare
| // runUpgradeOrInstall emulates `helm upgrade --install`: if a release with | ||
| // the configured name already exists it is upgraded in place, otherwise a | ||
| // fresh install is performed. | ||
| func runUpgradeOrInstall(ctx context.Context, actionCfg *action.Configuration, install *action.Install, chrt *chart.Chart, values map[string]any) { |
There was a problem hiding this comment.
In practice, will this ever actually perform an upgrade of an existing Helm release? If we expect each test to have a separate instance of the driver, then I think we'd want to error if that driver already exists instead of upgrade and risk affecting a different test.
The script used helm upgrade --install before mostly to be copy-pasteable, but that was always expected to first be run on a fresh kind cluster without the driver installed where a plain helm install should have worked.
There was a problem hiding this comment.
Ah you're right, this actually neds up hiding the errors and was leftover from trying to replicate helm upgrade --install. I'll remove it and use install.RunWithContext directly.
| actionCfg := newHelmActionConfig(cfg.Namespace, registryClient) | ||
|
|
||
| install := action.NewInstall(actionCfg) | ||
| install.ReleaseName = cfg.ReleaseName |
There was a problem hiding this comment.
Could we automatically generate a release name instead? That will make sure there aren't conflicts between tests, tests don't have to decide for themselves what the driver should be called, and we don't have to validate them up front.
| Expect(validation.IsDNS1123Label(cfg.ReleaseName)).To(BeEmpty(), | ||
| "DriverConfig.ReleaseName %q must be a valid DNS-1123 label", cfg.ReleaseName) | ||
| if cfg.Namespace == "" { | ||
| cfg.Namespace = cfg.ReleaseName |
There was a problem hiding this comment.
Like the release name, can we generate a new, separate namespace for each instance of the driver?
| // maxDriverNameLen is the longest driver name that fits within Linux's | ||
| // 108-byte UNIX_PATH_MAX after the kubelet appends its registrar socket | ||
| // prefix and per-pod UID suffix. | ||
| maxDriverNameLen = 28 |
There was a problem hiding this comment.
Can you expand on how this value was calculated?
There was a problem hiding this comment.
TBH copilot came up with this and I don't think it's necessary for the E2E test since we auto-generate default driver names that won't go over the limit anyways. I'll go ahead and remove it
There was a problem hiding this comment.
Does it simplify things if we construct these ResourceClaims inline vs. reading them from a file?
| // with the per-test driver name and (when set) extended resource name. | ||
| func substituteDriverIdentifiers(raw string, drv DriverConfig) string { | ||
| GinkgoHelper() | ||
| out := strings.ReplaceAll(raw, defaultDeviceClassName, drv.DriverName) |
There was a problem hiding this comment.
Eventually, the manifests will be referencing other drivers in addition to gpu.example.com. Some manifests may even rely on multiple separate instances of the driver running different profiles. Not a blocker for this PR, but we should find a better way to handle dynamic driver names.
|
FYI @willie-yao A couple of dependabot PRs merged which conflict with this PR. |
aa4b2eb to
833008d
Compare
nojnhuh
left a comment
There was a problem hiding this comment.
One more nit, otherwise LGTM.
Could you also check that all the mod/work/sum files are up to date?
go mod tidy && go -C test mod tidy
go work sync
go mod tidy && go -C test mod tidy # again
| Expect(cfg.ReleaseName).To(BeEmpty(), | ||
| "DriverConfig.ReleaseName is populated by installDriver; callers must not set it") | ||
| Expect(cfg.Namespace).To(BeEmpty(), | ||
| "DriverConfig.Namespace is populated by installDriver; callers must not set it") |
There was a problem hiding this comment.
Making these available to tweak and then checking to make sure they weren't tweaked isn't ideal. Could we set this in a separate local variable in installDriver and pass that around separately from the DriverConfig?
Having written this out, this sounds kind of insane and I'm open to the possibility that separate modules is going to be more pain going forward than it's worth. We could try sticking with one module for now and see if anybody runs into any issues. Splitting things up later will probably be easier than putting them back together. |
Yeah you're right. I noticed the go.work and go.sum files kept going out of date even when doing go mod tidy... I can go ahead and refactor into one module if you think that's a better way to go about things. |
7e93526 to
c6421b6
Compare
Yeah, I'm thinking let's go back to one module for now. Trying it here was still a worthwhile experiment nonetheless. |
nojnhuh
left a comment
There was a problem hiding this comment.
/lgtm
/approve
/hold in case we want to try to use Helm v4
| github.com/stretchr/testify v1.11.1 | ||
| github.com/urfave/cli/v2 v2.27.7 | ||
| google.golang.org/grpc v1.81.1 | ||
| helm.sh/helm/v3 v3.21.0 |
There was a problem hiding this comment.
There's a v4 for Helm which I don't think has too many changes from v3. Can we use that here? If it turns out to be more work, let's file an issue to update this since I'm not sure how long v3 will get security updates.
|
LGTM label has been added. DetailsGit tree hash: 4030de44291eece73f23b4bab4d9fda21c399085 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nojnhuh, willie-yao The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@willie-yao Could you also please squash? |
e740038 to
a78fddd
Compare
Signed-off-by: William Yao <william2000yao@gmail.com>
a78fddd to
a85fb42
Compare
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: 378fbf454ad4408d9c7d8b34703fe2bd4383b985 |
|
Added helm v4 |
Each
Itblock now installs its own DRA driver release with a unique driver name, and uninstalls it on teardown.Fixes #171
Changes:
test/e2e/driver_test.gowithinstallDriver(ctx, DriverConfig)helper. Shells out tohelm(already required bysetup-e2e.sh); honorsHELM_CHART_PATHenv var.deployManifestaccepts aDriverConfigand substitutesgpu.example.com-> per-test driver name (andexample.com/gpu-> per-test extended resource name) in demo manifests before applying.helm --wait, polls explicitly forDeviceClass+ResourceSlices+ webhook readiness before returning from install.DeferCleanupLIFO: driver-log diagnostics -> helm uninstall -> namespace delete with foreground propagation + termination wait.setup-e2e.shno longer installs the driver. Cert-manager + kind cluster + image build stay (cluster-wide infra).gpu.example.comand runOrdered, Serialso their static testdata stays valid.